Airbnb Rio Project - Property Price Prediction Tool for Common Users¶

Context¶

On Airbnb, anyone who has a room or property of any type (apartment, house, cottage, inn, etc.) can offer it for rent on a daily basis.

The host create a profile and also an announcement about the property.

In this announcement, the host should describe the property's characteristics as comprehensively as possible to help renters/travelers choose the best property for them.

There are numerous customizations available in the announcement, including minimum stay requirements, price, number of rooms, cancellation policies, extra fees for additional guests, requirement of identity verification for renters, etc.

The Objective¶

To build a price prediction model that allows common individuals who own a property to determine how much they should charge per night for their property.

Alternatively, for common renters, given the property they are seeking, to help determine if that property is competitively priced (below the average for properties with similar characteristics) or not.

Credits¶

The datasets were obtained from the Kaggle website: https://www.kaggle.com/allanbruno/airbnb-rio-de-janeiro

The datasets contain property prices and their respective characteristics for each month. The prices are given in Brazilian Real (BRL). We have data from April 2018 to May 2020, with the exception of June 2018, which does not have a dataset.

Initial Expectations¶

I believe seasonality can be an important factor, as months like December tend to be quite expensive in Rio de Janeiro. The location of the property should make a significant difference in the price since in Rio de Janeiro, location can completely change the characteristics of a place (safety, natural beauty, tourist attractions). Additional amenities/facilities may have a significant impact, considering the presence of many old buildings and houses in Rio de Janeiro. We will discover how much these factors impact prices and if there are other less intuitive factors that are extremely important.

Importing Libraries and Datasets¶

In [1]:
import pandas as pd
import pathlib
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split

Consolidating Databases¶

In [2]:
months = {'jan': 1, 'feb':2, 'mar':3, 'apr': 4, 'may':5, 'jun': 6, 'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12}

database_path = pathlib.Path('dataset')

base_airbnb = pd.DataFrame()

for file in database_path.iterdir():
    month_name = file.name[:3]
    month = months[month_name]
    
    year = file.name[-8:]
    year = int(year.replace('.csv', ''))
    
    df =  pd.read_csv(database_path / file.name)
    df['year'] = year
    df['month'] = month
    base_airbnb = base_airbnb.append(df)
    
display(base_airbnb)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (62,87) have mixed types. Specify dtype option on import or set low_memory=False.
  df =  pd.read_csv(database_path / file.name)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  base_airbnb = base_airbnb.append(df)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False.
  df =  pd.read_csv(database_path / file.name)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  base_airbnb = base_airbnb.append(df)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False.
  df =  pd.read_csv(database_path / file.name)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  base_airbnb = base_airbnb.append(df)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (87) have mixed types. Specify dtype option on import or set low_memory=False.
  df =  pd.read_csv(database_path / file.name)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  base_airbnb = base_airbnb.append(df)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False.
  df =  pd.read_csv(database_path / file.name)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  base_airbnb = base_airbnb.append(df)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (87) have mixed types. Specify dtype option on import or set low_memory=False.
  df =  pd.read_csv(database_path / file.name)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  base_airbnb = base_airbnb.append(df)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False.
  df =  pd.read_csv(database_path / file.name)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  base_airbnb = base_airbnb.append(df)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False.
  df =  pd.read_csv(database_path / file.name)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  base_airbnb = base_airbnb.append(df)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False.
  df =  pd.read_csv(database_path / file.name)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  base_airbnb = base_airbnb.append(df)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False.
  df =  pd.read_csv(database_path / file.name)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  base_airbnb = base_airbnb.append(df)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False.
  df =  pd.read_csv(database_path / file.name)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  base_airbnb = base_airbnb.append(df)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (87) have mixed types. Specify dtype option on import or set low_memory=False.
  df =  pd.read_csv(database_path / file.name)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  base_airbnb = base_airbnb.append(df)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False.
  df =  pd.read_csv(database_path / file.name)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  base_airbnb = base_airbnb.append(df)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False.
  df =  pd.read_csv(database_path / file.name)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  base_airbnb = base_airbnb.append(df)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False.
  df =  pd.read_csv(database_path / file.name)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  base_airbnb = base_airbnb.append(df)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False.
  df =  pd.read_csv(database_path / file.name)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  base_airbnb = base_airbnb.append(df)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (87) have mixed types. Specify dtype option on import or set low_memory=False.
  df =  pd.read_csv(database_path / file.name)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  base_airbnb = base_airbnb.append(df)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False.
  df =  pd.read_csv(database_path / file.name)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  base_airbnb = base_airbnb.append(df)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False.
  df =  pd.read_csv(database_path / file.name)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  base_airbnb = base_airbnb.append(df)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False.
  df =  pd.read_csv(database_path / file.name)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  base_airbnb = base_airbnb.append(df)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (87) have mixed types. Specify dtype option on import or set low_memory=False.
  df =  pd.read_csv(database_path / file.name)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  base_airbnb = base_airbnb.append(df)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (87) have mixed types. Specify dtype option on import or set low_memory=False.
  df =  pd.read_csv(database_path / file.name)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  base_airbnb = base_airbnb.append(df)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False.
  df =  pd.read_csv(database_path / file.name)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  base_airbnb = base_airbnb.append(df)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (87) have mixed types. Specify dtype option on import or set low_memory=False.
  df =  pd.read_csv(database_path / file.name)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  base_airbnb = base_airbnb.append(df)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:14: DtypeWarning: Columns (61,62,94) have mixed types. Specify dtype option on import or set low_memory=False.
  df =  pd.read_csv(database_path / file.name)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3746396787.py:17: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  base_airbnb = base_airbnb.append(df)
id listing_url scrape_id last_scraped name summary space description experiences_offered neighborhood_overview ... minimum_minimum_nights maximum_minimum_nights minimum_maximum_nights maximum_maximum_nights minimum_nights_avg_ntm maximum_nights_avg_ntm number_of_reviews_ltm calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms
0 14063 https://www.airbnb.com/rooms/14063 20180414160018 2018-04-14 Living in a Postcard Besides the most iconic's view, our apartment ... NaN Besides the most iconic's view, our apartment ... none Best and favorite neighborhood of Rio. Perfect... ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 17878 https://www.airbnb.com/rooms/17878 20180414160018 2018-04-14 Very Nice 2Br - Copacabana - WiFi Please note that special rates apply for New Y... - large balcony which looks out on pedestrian ... Please note that special rates apply for New Y... none This is the best spot in Rio. Everything happe... ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 24480 https://www.airbnb.com/rooms/24480 20180414160018 2018-04-14 Nice and cozy near Ipanema Beach My studio is located in the best of Ipanema. ... The studio is located at Vinicius de Moraes St... My studio is located in the best of Ipanema. ... none The beach, the lagoon, Ipanema is a great loca... ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 25026 https://www.airbnb.com/rooms/25026 20180414160018 2018-04-14 Beautiful Modern Decorated Studio in Copa Our apartment is a little gem, everyone loves ... This newly renovated studio (last renovations ... Our apartment is a little gem, everyone loves ... none Copacabana is a lively neighborhood and the ap... ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 31560 https://www.airbnb.com/rooms/31560 20180414160018 2018-04-14 NICE & COZY 1BDR - IPANEMA BEACH This nice and clean 1 bedroom apartment is loc... This nice and clean 1 bedroom apartment is loc... This nice and clean 1 bedroom apartment is loc... none Die Nachbarschaft von Ipanema ist super lebend... ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
34324 38844730 https://www.airbnb.com/rooms/38844730 20190923212307 2019-09-24 TRANSAMERICA BARRA DA TIJUCA R IV Em estilo contemporâneo, o Transamerica Prime ... NaN Em estilo contemporâneo, o Transamerica Prime ... none NaN ... 1.0 1.0 1125.0 1125.0 1.0 1125.0 0.0 15.0 0.0 0.0
34325 38846408 https://www.airbnb.com/rooms/38846408 20190923212307 2019-09-24 Alugo para o Rock in Rio Confortável apartamento, 2 quartos , sendo 1 s... O apartamento estará com mobília completa disp... Confortável apartamento, 2 quartos , sendo 1 s... none Muito próximo ao Parque Olímpico, local do eve... ... 2.0 2.0 1125.0 1125.0 2.0 1125.0 0.0 1.0 0.0 0.0
34326 38846703 https://www.airbnb.com/rooms/38846703 20190923212307 2019-09-24 Apt COMPLETO em COPACABANA c/TOTAL SEGURANÇA Apartamento quarto e sala COMPLETO para curtas... Espaço ideal para até 5 pessoas. Cama de casal... Apartamento quarto e sala COMPLETO para curtas... none NaN ... 3.0 3.0 1125.0 1125.0 3.0 1125.0 0.0 23.0 6.0 0.0
34327 38847050 https://www.airbnb.com/rooms/38847050 20190923212307 2019-09-24 Cobertura Cinematografica Cobertura alto nivel NaN Cobertura alto nivel none NaN ... 1.0 1.0 1125.0 1125.0 1.0 1125.0 0.0 1.0 0.0 0.0
34328 38847655 https://www.airbnb.com/rooms/38847655 20190923212307 2019-09-24 Quarto em cobertura em frente à praia III Quarto em cobertura quadriplex com vista lindí... NaN Quarto em cobertura quadriplex com vista lindí... none NaN ... 1.0 1.0 30.0 30.0 1.0 30.0 0.0 0.0 4.0 0.0

902210 rows × 108 columns

Data Cleaning¶

Now let's begin the data preprocessing phase. Since we have too many columns, our model may become slow, so we have to identify which columns we can exclude¶

Moreover, a quick analysis reveals that several columns are not necessary for our prediction model. Therefore, we will exclude some columns from our dataset.

Types of columns we will exclude:

IDs, links, and irrelevant information for the model Repeated or extremely similar columns that provide the same information to the model (e.g., Date vs. Year/Month) Columns filled with free-text -> We won't run any word analysis or similar processes Columns where all or almost all values are the same To do this, we will create an Excel file with the first 1,000 records and perform a qualitative analysis by examining the columns and identifying which ones are unnecessary.

In [3]:
print(list(base_airbnb.columns))
base_airbnb.head(1000).to_csv('first_registers.csv')
['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary', 'space', 'description', 'experiences_offered', 'neighborhood_overview', 'notes', 'transit', 'access', 'interaction', 'house_rules', 'thumbnail_url', 'medium_url', 'picture_url', 'xl_picture_url', 'host_id', 'host_url', 'host_name', 'host_since', 'host_location', 'host_about', 'host_response_time', 'host_response_rate', 'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood', 'host_listings_count', 'host_total_listings_count', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'street', 'neighbourhood', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market', 'smart_location', 'country_code', 'country', 'latitude', 'longitude', 'is_location_exact', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', 'square_feet', 'price', 'weekly_price', 'monthly_price', 'security_deposit', 'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights', 'maximum_nights', 'calendar_updated', 'has_availability', 'availability_30', 'availability_60', 'availability_90', 'availability_365', 'calendar_last_scraped', 'number_of_reviews', 'first_review', 'last_review', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'requires_license', 'license', 'jurisdiction_names', 'instant_bookable', 'is_business_travel_ready', 'cancellation_policy', 'require_guest_profile_picture', 'require_guest_phone_verification', 'calculated_host_listings_count', 'reviews_per_month', 'year', 'month', 'minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'number_of_reviews_ltm', 'calculated_host_listings_count_entire_homes', 'calculated_host_listings_count_private_rooms', 'calculated_host_listings_count_shared_rooms']

After the qualitative analysis of the columns, considering the criteria explained above, we are left with the following columns¶

In [4]:
columns = ['host_response_time','host_response_rate','host_is_superhost','host_listings_count','latitude','longitude','property_type','room_type','accommodates','bathrooms','bedrooms','beds','bed_type','amenities','price','security_deposit','cleaning_fee','guests_included','extra_people','minimum_nights','maximum_nights','number_of_reviews','review_scores_rating','review_scores_accuracy','review_scores_cleanliness','review_scores_checkin','review_scores_communication','review_scores_location','review_scores_value','instant_bookable','is_business_travel_ready','cancellation_policy','year','month']

base_airbnb = base_airbnb.loc[:, columns]
print(list(base_airbnb.columns))
display(base_airbnb)
['host_response_time', 'host_response_rate', 'host_is_superhost', 'host_listings_count', 'latitude', 'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', 'price', 'security_deposit', 'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights', 'maximum_nights', 'number_of_reviews', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'instant_bookable', 'is_business_travel_ready', 'cancellation_policy', 'year', 'month']
host_response_time host_response_rate host_is_superhost host_listings_count latitude longitude property_type room_type accommodates bathrooms ... review_scores_cleanliness review_scores_checkin review_scores_communication review_scores_location review_scores_value instant_bookable is_business_travel_ready cancellation_policy year month
0 NaN NaN f 1.0 -22.946854 -43.182737 Apartment Entire home/apt 4 1.0 ... 9.0 9.0 9.0 9.0 9.0 f f strict_14_with_grace_period 2018 4
1 within an hour 100% t 2.0 -22.965919 -43.178962 Condominium Entire home/apt 5 1.0 ... 9.0 10.0 10.0 9.0 9.0 t f strict 2018 4
2 within an hour 100% f 1.0 -22.985698 -43.201935 Apartment Entire home/apt 2 1.0 ... 10.0 10.0 10.0 10.0 9.0 f f strict 2018 4
3 within an hour 100% f 3.0 -22.977117 -43.190454 Apartment Entire home/apt 3 1.0 ... 10.0 10.0 10.0 10.0 9.0 f f strict 2018 4
4 within an hour 100% t 1.0 -22.983024 -43.214270 Apartment Entire home/apt 3 1.0 ... 10.0 10.0 10.0 10.0 9.0 t f strict 2018 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
34324 within an hour 93% f 0.0 -23.003180 -43.342840 Apartment Entire home/apt 4 1.0 ... NaN NaN NaN NaN NaN f f flexible 2019 9
34325 NaN NaN f 0.0 -22.966640 -43.393450 Apartment Entire home/apt 4 2.0 ... NaN NaN NaN NaN NaN f f flexible 2019 9
34326 within a few hours 74% f 32.0 -22.962080 -43.175520 Apartment Entire home/apt 5 1.0 ... NaN NaN NaN NaN NaN f f strict_14_with_grace_period 2019 9
34327 NaN NaN f 0.0 -23.003400 -43.341820 Apartment Entire home/apt 4 1.0 ... NaN NaN NaN NaN NaN f f strict_14_with_grace_period 2019 9
34328 a few days or more 38% f 5.0 -23.010560 -43.363350 Apartment Private room 2 0.0 ... NaN NaN NaN NaN NaN f f strict_14_with_grace_period 2019 9

902210 rows × 34 columns

Handling Missing Values (NaN)¶

  • When visualizing the data, there's a significant disparity in missing values. Columns with more than 300,000 NaN values were excluded from the analysis.
  • For the remaining columns, since we have a large amount of data (over 900,000 rows), we will exclude the rows that contain NaN values
In [5]:
for column in base_airbnb:
    if base_airbnb[column].isnull().sum() > 300000:
        base_airbnb = base_airbnb.drop(column, axis=1)
print(base_airbnb.isnull().sum())
host_is_superhost            460
host_listings_count          460
latitude                       0
longitude                      0
property_type                  0
room_type                      0
accommodates                   0
bathrooms                   1724
bedrooms                     850
beds                        2502
bed_type                       0
amenities                      0
price                          0
guests_included                0
extra_people                   0
minimum_nights                 0
maximum_nights                 0
number_of_reviews              0
instant_bookable               0
is_business_travel_ready       0
cancellation_policy            0
year                           0
month                          0
dtype: int64
  • Excluding empty rows:
In [6]:
base_airbnb = base_airbnb.dropna()

print(base_airbnb.shape)
print(base_airbnb.isnull().sum())
(897709, 23)
host_is_superhost           0
host_listings_count         0
latitude                    0
longitude                   0
property_type               0
room_type                   0
accommodates                0
bathrooms                   0
bedrooms                    0
beds                        0
bed_type                    0
amenities                   0
price                       0
guests_included             0
extra_people                0
minimum_nights              0
maximum_nights              0
number_of_reviews           0
instant_bookable            0
is_business_travel_ready    0
cancellation_policy         0
year                        0
month                       0
dtype: int64

Checking Data Types in Each Column¶

In [7]:
print(base_airbnb.dtypes) #Printing the data types
print('-'*60) 
print(base_airbnb.iloc[0]) #Visualizing only the first row of each column to analyze the content
host_is_superhost            object
host_listings_count         float64
latitude                    float64
longitude                   float64
property_type                object
room_type                    object
accommodates                  int64
bathrooms                   float64
bedrooms                    float64
beds                        float64
bed_type                     object
amenities                    object
price                        object
guests_included               int64
extra_people                 object
minimum_nights                int64
maximum_nights                int64
number_of_reviews             int64
instant_bookable             object
is_business_travel_ready     object
cancellation_policy          object
year                          int64
month                         int64
dtype: object
------------------------------------------------------------
host_is_superhost                                                           f
host_listings_count                                                       1.0
latitude                                                           -22.946854
longitude                                                          -43.182737
property_type                                                       Apartment
room_type                                                     Entire home/apt
accommodates                                                                4
bathrooms                                                                 1.0
bedrooms                                                                  0.0
beds                                                                      2.0
bed_type                                                             Real Bed
amenities                   {TV,Internet,"Air conditioning",Kitchen,Doorma...
price                                                                 $133.00
guests_included                                                             2
extra_people                                                           $34.00
minimum_nights                                                             60
maximum_nights                                                            365
number_of_reviews                                                          38
instant_bookable                                                            f
is_business_travel_ready                                                    f
cancellation_policy                               strict_14_with_grace_period
year                                                                     2018
month                                                                       4
Name: 0, dtype: object
  • As price and extra_people are being recognized as objects (instead of being a float), we need to change the data type of the column
In [8]:
#price
base_airbnb['price'] = base_airbnb['price'].str.replace('$', '')
base_airbnb['price'] = base_airbnb['price'].str.replace(',', '')
base_airbnb['price'] = base_airbnb['price'].astype(np.float32, copy=False)

#extra_people
base_airbnb['extra_people'] = base_airbnb['extra_people'].str.replace('$', '')
base_airbnb['extra_people'] = base_airbnb['extra_people'].str.replace(',', '')
base_airbnb['extra_people'] = base_airbnb['extra_people'].astype(np.float32, copy=False)


#verifying the updated data
print(base_airbnb.dtypes)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\4223115622.py:2: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  base_airbnb['price'] = base_airbnb['price'].str.replace('$', '')
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\4223115622.py:7: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  base_airbnb['extra_people'] = base_airbnb['extra_people'].str.replace('$', '')
host_is_superhost            object
host_listings_count         float64
latitude                    float64
longitude                   float64
property_type                object
room_type                    object
accommodates                  int64
bathrooms                   float64
bedrooms                    float64
beds                        float64
bed_type                     object
amenities                    object
price                       float32
guests_included               int64
extra_people                float32
minimum_nights                int64
maximum_nights                int64
number_of_reviews             int64
instant_bookable             object
is_business_travel_ready     object
cancellation_policy          object
year                          int64
month                         int64
dtype: object

Exploratory Analysis and Outlier Treatment¶

  • It's needed to examine feature by feature to:

    1. Assess the correlation between the features and decide whether to keep all the features we have.
    2. Exclude outliers (using the rule: values below Q1 - 1.5 x Amplitude and values above Q3 + 1.5 x Amplitude). Amplitude = Q3 - Q1.
    3. Confirm whether all the features we have actually make sense for our model or if any of them will not be helpful and should be excluded.

  • Let's start with the price columns (the final outcome we want) and extra_people (also a monetary value). These are continuous numerical values.

  • Then I will analyze columns with discrete numerical values (accommodates, bedrooms, guests_included, etc.).

  • Finally, I will evaluate text columns and determine which categories make sense to keep or discard.

In [9]:
plt.figure(figsize=(15, 10))
sns.heatmap(base_airbnb.corr(), annot=True, cmap='Blues')
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\3685621271.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  sns.heatmap(base_airbnb.corr(), annot=True, cmap='Blues')
Out[9]:
<Axes: >
  • Since the correlation between the columns is below 0.70, I will consider them to be at a good level to be considered distinct from each other. If they had a higher correlation, then I would have to verify whether I would exclude any of them.

Definition of Functions for Outlier Analysis¶

  • Now I'm going to define some functions to assist in the analysis of outliers in the columns.
In [10]:
def fences(column):
    q1 = column.quantile(0.25)
    q3 = column.quantile(0.75)
    iqr = q3 - q1              #iqr = interquartile range
    lower_fence = q1 - 1.5 * iqr
    upper_fence = q3 + 1.5 * iqr
    return lower_fence, upper_fence

def remove_outliers(df, column_name):
    qty_rows = df.shape[0]
    lower_fence, upper_fence = fences(df[column_name])
    df = df.loc[(df[column_name] >= lower_fence) & (df[column_name] <= upper_fence), :]
    removed_rows = qty_rows - df.shape[0]
    return df,  removed_rows
In [11]:
def boxplot(column):
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(15, 5)          #Defining the size of boxplots
    sns.boxplot(x=column, ax=ax1)
    ax2.set_xlim(fences(column))  #The second boxplot will show only the fence range, without showing the outliers.
    sns.boxplot(x=column, ax=ax2)
    
def histogram(column):
    plt.figure(figsize=(15, 5))
    sns.distplot(column, hist=True)
    
def bar_chart(column):  
    plt.figure(figsize=(15, 5))
    ax = sns.barplot(x=column.value_counts().index, y=column.value_counts())
    ax.set_xlim(fences(column))

Analyzing Price¶

In [12]:
boxplot(base_airbnb['price'])
histogram(base_airbnb['price'])
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\4108438942.py:10: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(column, hist=True)

I'm building a model for regular residential properties, I believe that values above the upper fence will only represent extremely luxurious apartments, which is not my main focus. Therefore, we I exclude these outliers.

In [13]:
base_airbnb, removed_rows = remove_outliers(base_airbnb, 'price')
print('{} rows were removed'.format(removed_rows))
87282 rows were removed
In [14]:
histogram(base_airbnb['price'])
print(base_airbnb.shape)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\4108438942.py:10: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(column, hist=True)
(810427, 23)

Analyzing extra_people¶

In [15]:
boxplot(base_airbnb['extra_people'])
histogram(base_airbnb['extra_people'])
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\4108438942.py:10: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(column, hist=True)

I'm removing the outliers from this column too because the values are too much

In [16]:
base_airbnb, removed_rows = remove_outliers(base_airbnb, 'extra_people')
print('{} rows were removed'.format(removed_rows))
59194 rows were removed
In [17]:
histogram(base_airbnb['extra_people'])
print(base_airbnb.shape)
C:\Users\davia\AppData\Local\Temp\ipykernel_25972\4108438942.py:10: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(column, hist=True)
(751233, 23)

Analyzing host_listings_count¶

In [18]:
boxplot(base_airbnb['host_listings_count'])
bar_chart(base_airbnb['host_listings_count'])

We can exclude the outliers because, for the purpose of our project, hosts with more than 6 properties on Airbnb are not the target audience. I imagine they might be real estate investors or professionals managing properties on Airbnb.

In [19]:
base_airbnb, removed_rows = remove_outliers(base_airbnb, 'host_listings_count')
print('{} rows were removed'.format(removed_rows))
97723 rows were removed

Analyzing accommodates¶

In [20]:
boxplot(base_airbnb['accommodates'])
bar_chart(base_airbnb['accommodates'])
In [21]:
base_airbnb, removed_rows = remove_outliers(base_airbnb, 'accommodates')
print('{} rows were removed'.format(removed_rows))
13146 rows were removed

Analyzing bathrooms¶

In [22]:
boxplot(base_airbnb['bathrooms'])
plt.figure(figsize=(15, 5))
sns.barplot(x=base_airbnb['bathrooms'].value_counts().index, y=base_airbnb['bathrooms'].value_counts())
Out[22]:
<Axes: ylabel='bathrooms'>
In [23]:
base_airbnb, removed_rows = remove_outliers(base_airbnb, 'bathrooms')
print('{} rows were removed'.format(removed_rows))
6894 rows were removed

Analyzing bedrooms¶

In [24]:
boxplot(base_airbnb['bedrooms'])
bar_chart(base_airbnb['bedrooms'])
In [25]:
base_airbnb, removed_rows = remove_outliers(base_airbnb, 'bedrooms')
print('{} rows were removed'.format(removed_rows))
5482 rows were removed

Analyzing beds¶

In [26]:
boxplot(base_airbnb['beds'])
bar_chart(base_airbnb['beds'])
In [27]:
base_airbnb, removed_rows = remove_outliers(base_airbnb, 'beds')
print('{} rows were removed'.format(removed_rows))
5622 rows were removed

Analyzing guests_included¶

In [28]:
#boxplot(base_airbnb['guests_included'])
#bar_chart(base_airbnb['guests_included'])
print(fences(base_airbnb['guests_included']))
plt.figure(figsize=(15, 5))
sns.barplot(x=base_airbnb['guests_included'].value_counts().index, y=base_airbnb['guests_included'].value_counts())
(1.0, 1.0)
Out[28]:
<Axes: ylabel='guests_included'>

I'm removing this feature from the analysis. It appears that Airbnb users frequently use the default value of 1 guest included. This can lead the model to consider a feature that is not actually essential for determining the price. Therefore, it seems better to exclude the column from the analysis

In [29]:
base_airbnb = base_airbnb.drop('guests_included', axis=1)
base_airbnb.shape
Out[29]:
(622366, 22)

Analyzing minimum_nights¶

In [30]:
boxplot(base_airbnb['minimum_nights'])
bar_chart(base_airbnb['minimum_nights'])
  • Here I have an even stronger reason to exclude these apartments from the analysis.

  • I'm aiming to build a model that helps price regular apartments as an average person would like to list them. In the case of apartments with a "minimum nights" value greater than 8, they could be seasonal rentals or apartments for long-term living where the host requires a minimum stay of at least one month.

  • Therefore, let's exclude the outliers from this column

In [31]:
base_airbnb, removed_rows = remove_outliers(base_airbnb, 'minimum_nights')
print('{} rows were removed'.format(removed_rows))
40383 rows were removed

Analyzing maximum_nights¶

In [32]:
boxplot(base_airbnb['maximum_nights'])
bar_chart(base_airbnb['maximum_nights'])
  • This column doesn't seem like it will contribute to the analysis.

  • That's because it appears that nearly all hosts do not fill in the "maximum nights" field, so it doesn't seem to be a relevant factor.

  • It's better to exclude this column from the analysis.

In [33]:
base_airbnb = base_airbnb.drop('maximum_nights', axis=1)
base_airbnb.shape
Out[33]:
(581983, 21)

Analyzing number_of_reviews¶

In [34]:
boxplot(base_airbnb['number_of_reviews'])
bar_chart(base_airbnb['number_of_reviews'])
  • Here we could consider different approaches. I will make a decision based on my personal analysis to exclude this feature for a few reasons:
  1. If I exclude the outliers, I will also exclude hosts with the highest number of reviews (which are typically the hosts with more rentals). This could have a significantly negative impact on our model.
  2. Considering my objective, if I have a vacant property and want to list it, it's expected that I wouldn't have any reviews. Therefore, excluding this feature from the analysis might actually be beneficial.
  3. Personally, I'm uncertain whether this feature should impact the final price or not.

Treatment of Text Value Columns¶

Analyzing property_type¶

In [35]:
print(base_airbnb['property_type'].value_counts())
plt.figure(figsize=(15, 5))
chart = sns.countplot(x='property_type', data=base_airbnb)
chart.tick_params(axis='x', rotation=90)
Apartment                 458354
House                      51387
Condominium                26456
Serviced apartment         12671
Loft                       12352
Guest suite                 3621
Bed and breakfast           3472
Hostel                      2665
Guesthouse                  2155
Other                       1957
Villa                       1294
Townhouse                    969
Aparthotel                   693
Chalet                       481
Earth house                  468
Tiny house                   457
Boutique hotel               447
Hotel                        376
Casa particular (Cuba)       298
Cottage                      230
Bungalow                     207
Dorm                         185
Cabin                        141
Nature lodge                 124
Castle                        80
Treehouse                     76
Island                        54
Boat                          53
Hut                           40
Campsite                      34
Resort                        31
Camper/RV                     24
Yurt                          23
Tent                          18
Tipi                          17
Barn                          15
Farm stay                     13
Pension (South Korea)          9
Dome house                     8
Igloo                          6
In-law                         6
Vacation home                  4
Timeshare                      3
Pousada                        3
Houseboat                      3
Casa particular                2
Plane                          1
Name: property_type, dtype: int64
  • Here, my action is not to "exclude outliers", but rather to group values that are very small.

  • All property types that have fewer than 2,000 occurrences in the database I will group them into a category called "others". I believe this will simplify the model.

In [36]:
home_type_table = base_airbnb['property_type'].value_counts()
group_columns = []

for types in home_type_table.index:
    if home_type_table[types] < 2000:
        group_columns.append(types)
print(group_columns)

for types in group_columns:
    base_airbnb.loc[base_airbnb['property_type']==types, 'property_type'] = 'Others'

print(base_airbnb['property_type'].value_counts())
plt.figure(figsize=(15, 5))
chart = sns.countplot(x='property_type', data=base_airbnb)
chart.tick_params(axis='x', rotation=90)
['Other', 'Villa', 'Townhouse', 'Aparthotel', 'Chalet', 'Earth house', 'Tiny house', 'Boutique hotel', 'Hotel', 'Casa particular (Cuba)', 'Cottage', 'Bungalow', 'Dorm', 'Cabin', 'Nature lodge', 'Castle', 'Treehouse', 'Island', 'Boat', 'Hut', 'Campsite', 'Resort', 'Camper/RV', 'Yurt', 'Tent', 'Tipi', 'Barn', 'Farm stay', 'Pension (South Korea)', 'Dome house', 'Igloo', 'In-law', 'Vacation home', 'Timeshare', 'Pousada', 'Houseboat', 'Casa particular', 'Plane']
Apartment             458354
House                  51387
Condominium            26456
Serviced apartment     12671
Loft                   12352
Others                  8850
Guest suite             3621
Bed and breakfast       3472
Hostel                  2665
Guesthouse              2155
Name: property_type, dtype: int64

Analyzing room_type¶

In [37]:
print(base_airbnb['room_type'].value_counts())
plt.figure(figsize=(15, 5))
chart = sns.countplot(x='room_type', data=base_airbnb)
chart.tick_params(axis='x', rotation=90)
Entire home/apt    372443
Private room       196859
Shared room         11714
Hotel room            967
Name: room_type, dtype: int64

Analyzing bed_type¶

In [38]:
print(base_airbnb['bed_type'].value_counts())
plt.figure(figsize=(15, 5))
chart = sns.countplot(x='bed_type', data=base_airbnb)
chart.tick_params(axis='x', rotation=90)
Real Bed         570643
Pull-out Sofa      8055
Futon              1634
Airbed             1155
Couch               496
Name: bed_type, dtype: int64
In [39]:
# grouping categories of bed_type
bed_table = base_airbnb['bed_type'].value_counts()
group_columns = []

for types in bed_table.index:
    if bed_table[types] < 10000:
        group_columns.append(types)
print(group_columns)

for types in group_columns:
    base_airbnb.loc[base_airbnb['bed_type']==types, 'bed_type'] = 'Others'

print(base_airbnb['bed_type'].value_counts())
plt.figure(figsize=(15, 5))
chart = sns.countplot(x='bed_type', data=base_airbnb)
chart.tick_params(axis='x', rotation=90)
['Pull-out Sofa', 'Futon', 'Airbed', 'Couch']
Real Bed    570643
Others       11340
Name: bed_type, dtype: int64

Analyzing cancellation_policy¶

In [40]:
# grouping categories of cancellation_pollicy
cancellation_table = base_airbnb['cancellation_policy'].value_counts()
group_columns = []

for types in cancellation_table.index:
    if cancellation_table[types] < 10000:
        group_columns.append(types)
print(group_columns)

for types in group_columns:
    base_airbnb.loc[base_airbnb['cancellation_policy']==types, 'cancellation_policy'] = 'strict'

print(base_airbnb['cancellation_policy'].value_counts())
plt.figure(figsize=(15, 5))
chart = sns.countplot(x='cancellation_policy', data=base_airbnb)
chart.tick_params(axis='x', rotation=90)
['strict', 'super_strict_60', 'super_strict_30']
flexible                       258096
strict_14_with_grace_period    200743
moderate                       113281
strict                           9863
Name: cancellation_policy, dtype: int64

Analyzing amentities¶

Since we have a wide variety of amenities, and sometimes these amenities can be written differently, I will assess the quantity of amenities as the parameter for the model.

In [41]:
base_airbnb.shape
Out[41]:
(581983, 21)
In [42]:
base_airbnb['n_amenities'] = base_airbnb['amenities'].str.split(',').apply(len)
In [43]:
base_airbnb = base_airbnb.drop('amenities', axis=1)
base_airbnb.shape
Out[43]:
(581983, 21)

Now we can analyze the column n_amenities just like the way the other numerical columns were analyzed:

In [44]:
boxplot(base_airbnb['n_amenities'])
bar_chart(base_airbnb['n_amenities'])
In [45]:
base_airbnb, removed_rows = remove_outliers(base_airbnb, 'n_amenities')
print('{} rows were removed'.format(removed_rows))
24343 rows were removed

Property Map Visualization¶

I'm now creating a map that displays a random subset of our database (50,000 properties) to see how the properties are distributed throughout the city and also identify areas with higher prices.

In [46]:
sample = base_airbnb.sample(n=50000)
map_center = {'lat':sample.latitude.mean(), 'lon':sample.longitude.mean()}
map_chart = px.density_mapbox(sample, lat='latitude', lon='longitude',z='price', radius=2.5,
                        center=map_center, zoom=10,
                        mapbox_style='stamen-terrain')
map_chart.show()

Encoding¶

I will now adjust the features to facilitate the work of the future model.

  • For True or False values, I will replace True with 1 and False with 0.

  • For categorical features (features where the column values are texts), I will use the method of encoding variables as dummies.

In [47]:
#Replacing true with 1 and false with 0
tf_columns = ['host_is_superhost', 'instant_bookable', 'is_business_travel_ready']
base_airbnb_cod = base_airbnb.copy()
for column in tf_columns:
    base_airbnb_cod.loc[base_airbnb_cod[column]=='t', column] = 1
    base_airbnb_cod.loc[base_airbnb_cod[column]=='f', column] = 0
    
print(base_airbnb_cod.iloc[0])
host_is_superhost                         1
host_listings_count                     2.0
latitude                         -22.965919
longitude                        -43.178962
property_type                   Condominium
room_type                   Entire home/apt
accommodates                              5
bathrooms                               1.0
bedrooms                                2.0
beds                                    2.0
bed_type                           Real Bed
price                                 270.0
extra_people                           51.0
minimum_nights                            4
number_of_reviews                       205
instant_bookable                          1
is_business_travel_ready                  0
cancellation_policy                  strict
year                                   2018
month                                     4
n_amenities                              25
Name: 1, dtype: object
In [48]:
#Method of encoding variables as dummies
columns_categories = ['property_type', 'room_type', 'bed_type', 'cancellation_policy']
base_airbnb_cod = pd.get_dummies(data=base_airbnb_cod, columns=columns_categories)
display(base_airbnb_cod.head())
host_is_superhost host_listings_count latitude longitude accommodates bathrooms bedrooms beds price extra_people ... room_type_Entire home/apt room_type_Hotel room room_type_Private room room_type_Shared room bed_type_Others bed_type_Real Bed cancellation_policy_flexible cancellation_policy_moderate cancellation_policy_strict cancellation_policy_strict_14_with_grace_period
1 1 2.0 -22.965919 -43.178962 5 1.0 2.0 2.0 270.0 51.0 ... 1 0 0 0 0 1 0 0 1 0
3 0 3.0 -22.977117 -43.190454 3 1.0 1.0 2.0 161.0 45.0 ... 1 0 0 0 0 1 0 0 1 0
4 1 1.0 -22.983024 -43.214270 3 1.0 1.0 2.0 222.0 68.0 ... 1 0 0 0 0 1 0 0 1 0
5 1 1.0 -22.988165 -43.193588 3 1.5 1.0 2.0 308.0 86.0 ... 1 0 0 0 0 1 0 0 1 0
6 1 1.0 -22.981269 -43.190457 2 1.0 1.0 2.0 219.0 80.0 ... 1 0 0 0 0 1 0 0 1 0

5 rows × 37 columns

Prediction Model and Analysis of the Best Model¶

  • Evaluation Metrics

Here I will use the R² metric, which tells how well the model can explain the price. This will be a great parameter to assess the model quality.

-> The closer to 100%, the better.

I will also calculate the Root-Mean-Squared Error (RMSE), which will show us how much the model is deviating from the actual values.

-> The smaller the error, the better

In [49]:
def evaluate_model(model_name, y_test, prediction):
    r2 = r2_score(y_test, prediction)
    RMSE = np.sqrt(mean_squared_error(y_test, prediction))
    return f'Model {model_name}:\nR²:{r2:.2%}\nRMSE:{RMSE:.2f}'
  • Selection of Models to be tested
    1. RandomForest
    2. LinearRegression
    3. Extra Tree

These are some of the models available for numerical value prediction (regression). Since it's needed to calculate the price, which involves predicting a numerical value, I have chosen these three models.

In [50]:
rf_model = RandomForestRegressor()
lr_model = LinearRegression()
et_model = ExtraTreesRegressor()

models = {'RandomForest': rf_model,
          'LinearRegression': lr_model,
          'ExtraTrees': et_model,
          }

y = base_airbnb_cod['price']
X = base_airbnb_cod.drop('price', axis=1)
  • Data Splitting into Training and Testing + Model Training

This step is crucial. Artificial Intelligence learns from training.

Basically, what I do is separate the data into training and testing sets. For example, allocating 10% of the dataset for testing and 90% for training (usually, the training set is larger).

Next, it's needed to provide the training data to the model, allowing it to analyze that data and learn how to predict prices.

Once the model has learned, it's necessary to evaluate its performance by testing it with the testing data. It's possible to determine the best model by analyzing the results from the testing data.

In [51]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10)

for model_name, model in models.items():
    #training
    model.fit(X_train, y_train)
    #testing
    prediction = model.predict(X_test)
    print(evaluate_model(model_name, y_test, prediction))
Model RandomForest:
R²:96.87%
RMSE:46.88
Model LinearRegression:
R²:33.26%
RMSE:216.61
Model ExtraTrees:
R²:97.41%
RMSE:42.68
  • Model Chosen as Best Model: ExtraTressRegressor

    This was the model with the highest R² value and at the same time the lowest RMSE value. Since we did not have a significant difference in training and prediction speed between this model and the RandomForest model (which had similar R² and RMSE results), we will choose the ExtraTrees Model.

    The linear regression model did not yield satisfactory results, with R² and RMSE values much worse than the other 2 models.

    • Evaluation Metric Results for the Winning Model: ExtraTrees Model:
      R²: 97.41%
      RMSE: 42.68

Adjustments and Improvements to the Best Model¶

In [58]:
#print(et_model.feature_importances_)
#print(X_train.columns)
importance_features = pd.DataFrame(et_model.feature_importances_,X_train.columns)
importance_features = importance_features.sort_values(by = 0, ascending = False)
display(importance_features)
plt.figure(figsize=(15, 5))
ax = sns.barplot(x=importance_features.index, y=importance_features[0])
ax.tick_params(axis = 'x', rotation = 90)
0
bedrooms 0.124351
latitude 0.092095
longitude 0.086285
extra_people 0.082756
n_amenities 0.074254
bathrooms 0.067721
number_of_reviews 0.067232
room_type_Entire home/apt 0.064895
accommodates 0.062930
minimum_nights 0.062276
beds 0.046582
host_listings_count 0.036521
instant_bookable 0.021183
cancellation_policy_flexible 0.018356
property_type_Apartment 0.012602
cancellation_policy_moderate 0.011831
host_is_superhost 0.010735
year 0.010714
cancellation_policy_strict_14_with_grace_period 0.008723
property_type_House 0.006926
property_type_Condominium 0.004862
month 0.004443
room_type_Private room 0.003599
bed_type_Real Bed 0.002536
bed_type_Others 0.002490
property_type_Others 0.002241
property_type_Loft 0.002232
property_type_Serviced apartment 0.002136
room_type_Shared room 0.001870
property_type_Bed and breakfast 0.001268
property_type_Guesthouse 0.000893
cancellation_policy_strict 0.000878
property_type_Guest suite 0.000625
property_type_Hostel 0.000618
room_type_Hotel room 0.000341
is_business_travel_ready 0.000000

Final model adjustments¶

  • is_business_travel_ready doesn't seem to have a huge impact in the model. So I'm removing this feature so possibly I can have a better model
In [62]:
base_airbnb_cod = base_airbnb_cod.drop('is_business_travel_ready', axis=1)

y = base_airbnb_cod['price']
X = base_airbnb_cod.drop('price', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10)

et_model.fit(X_train, y_train)
prediction = et_model.predict(X_test)
print(evaluate_model('ExtraTrees', y_test, prediction))
Model ExtraTrees:
R²:97.41%
RMSE:42.70

There's almost no impact in the model, so it'll be better to keep it like this so the model can run faster

Before:
R²: 97.41%
RMSE: 42.68

Now:
Model ExtraTrees:
R²:97.41%
RMSE:42.70

In [64]:
test_base = base_airbnb_cod.copy()
for column in test_base:
    if 'bed_type' in column:
        test_base = test_base.drop(column, axis = 1)
print(test_base.columns)
y = test_base['price']
X = test_base.drop('price', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10)

et_model.fit(X_train, y_train)
prediction = et_model.predict(X_test)
print(evaluate_model('ExtraTrees', y_test, prediction))
Index(['host_is_superhost', 'host_listings_count', 'latitude', 'longitude',
       'accommodates', 'bathrooms', 'bedrooms', 'beds', 'price',
       'extra_people', 'minimum_nights', 'number_of_reviews',
       'instant_bookable', 'year', 'month', 'n_amenities',
       'property_type_Apartment', 'property_type_Bed and breakfast',
       'property_type_Condominium', 'property_type_Guest suite',
       'property_type_Guesthouse', 'property_type_Hostel',
       'property_type_House', 'property_type_Loft', 'property_type_Others',
       'property_type_Serviced apartment', 'room_type_Entire home/apt',
       'room_type_Hotel room', 'room_type_Private room',
       'room_type_Shared room', 'cancellation_policy_flexible',
       'cancellation_policy_moderate', 'cancellation_policy_strict',
       'cancellation_policy_strict_14_with_grace_period'],
      dtype='object')
Model ExtraTrees:
R²:97.39%
RMSE:42.86

Just like in the previous analysis, there were hardly any changes to the model, and yet it was possible to remove more features to make the process faster. Therefore, I will keep the model as it is.

Before:
R²: 97.41%
RMSE: 42.70

Now:
Model ExtraTrees:
R²:97.39%
RMSE:42.86

Project Deploy¶

In [67]:
X['price'] = y
X.to_csv('data.csv')
In [68]:
import joblib
joblib.dump(et_model, 'model.joblib')
Out[68]:
['model.joblib']

By using joblib, it was possible to create a file that contains the entire trained model so that it does not need to be retrained when opening the file. This is of utmost importance to create the website that will make predictions of properties prices.

In [ ]: